Day 15: Advanced Tokenizer
The tokenizer is the translator between text and the model. Without proper tokenization, even the best model cannot function correctly. Today we cover advanced tokenizer topics frequently encountered in practice.
encode/decode and Special Tokens
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, how are you?"
# encode: text -> list of token IDs
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
# [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# 101=[CLS], 102=[SEP] special tokens are automatically added
# decode: token IDs -> text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")
# encode without special tokens
token_ids_no_special = tokenizer.encode(text, add_special_tokens=False)
print(f"Without special tokens: {token_ids_no_special}")
# Inspect individual tokens
tokens = tokenizer.tokenize(text)
print(f"Token list: {tokens}")
# ['hello', ',', 'how', 'are', 'you', '?']
Padding and Truncation Strategies
When processing batches, all inputs must have the same length. Padding extends short inputs, and truncation clips long inputs.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentences = [
"A short sentence",
"This is a somewhat longer sentence",
"This is a very very very long sentence to demonstrate the difference between padding and truncation",
]
# Batch tokenization - apply padding and truncation simultaneously
encoded = tokenizer(
sentences,
padding=True, # Pad to the longest sentence
truncation=True, # Truncate if exceeding max_length
max_length=20, # Maximum number of tokens
return_tensors="pt", # Return as PyTorch tensors
)
print(f"input_ids shape: {encoded['input_ids'].shape}")
print(f"attention_mask shape: {encoded['attention_mask'].shape}")
# attention_mask: 1 means real token, 0 means padding
for i, sent in enumerate(sentences):
real_tokens = encoded["attention_mask"][i].sum().item()
print(f"Sentence {i+1}: {real_tokens} real tokens, {20 - real_tokens} padding tokens")
Building Chat Format with chat_template
Modern LLMs require a chat format (system/user/assistant). Using apply_chat_template() automatically applies the correct format for each model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "system", "content": "You are a friendly AI assistant."},
{"role": "user", "content": "Tell me 3 advantages of Python."},
]
# Convert chat format to model-specific format
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False, # Return as string (True returns token IDs)
add_generation_prompt=True, # Add assistant response start tag
)
print(formatted)
# Tokenize in one step
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
print(f"Token count: {input_ids.shape[-1]}")
Each model has a different chat_template. Llama uses <|begin_of_text|> tags, while the ChatML format uses <|im_start|> tags. Using apply_chat_template() means you do not need to worry about these differences.
Today’s Exercises
- Tokenize the same English sentence with
gpt2andbert-base-uncasedtokenizers, then compare the token count and tokenization approach differences. - Batch-tokenize 5 sentences of different lengths with
padding="max_length"andmax_length=32, then calculate the padding ratio in theattention_maskof each sentence. - Compare the
apply_chat_template()results of 2 models (e.g., Llama, Mistral) to see how the same conversation is formatted differently for each model.